[conflicted] Will prefer haven::is.labelled over any other package.
[conflicted] Will prefer dplyr::filter over any other package.
R workflow
Agenda
- R & RStudio Workflow
- Quarto
- Getting Started
Learning objectives
By the end of the lab, you will be able to …
- setup a reproducible workflow using R and RStudio
- familiarize yourself with a dataset using R and RStudio
- create a reproducible report using Quarto
R & RStudio Workflow
Replication
The guiding principle for workflow.
A workflow of data analysis is a process for managing all aspects of data analysis.
Planning, documenting, and organizing your work; cleaning the data; creating, renaming, and verifying variables; performing and presenting statistical analyses; producing replicable results; and archiving what you have done are all integral parts of your workflow.
Source: (Long 2008)
Steps in a workflow
| Set up | Systematic organization of the project and project files. |
| Familiarize self with data | Skipping takes more time in the long run. |
| Process data | Takes the MOST time. |
| Running analyses | What people THINK takes the most time. |
| Presenting results | What people (wrongly) think does not take time. |
File types
There are many file types, but these are key to an R & RStudio workflow (and likely new to you):
| Extension | Description |
|---|---|
| .Rproj | RStudio project file (keeps project settings). |
| .R | R scripts store a sequence of R commands (code) that can be run all at once or line by line. |
| .qmd | Quarto Markdown creates reproducible documents that contain a combination of text, code, and output. |
| .Rdata (or sometimes .rda) | These store and load R objects—like data frames. |
File names
should be:
- machine-readable
- human-readable
- play well with default-ordering
RStudio projects
Create a RStudio project for each data analysis project.
It supports an organized and reproducible workflow, cleanly separated from all other projects that you are working on. Everything you need in one place:
- local data files to load into RStudio.
- scripts to edit or run in bits or as a whole.
- Save your outputs (plots and cleaned data).
Filepaths
Adopting a project-based workflow avoids changing file paths.
ABSOLUTE FILE PATHS
Department of Sociology
Unit 17100, 17th Floor, Ontario Power Building
700 University Ave., Toronto, ON M5G 1Z5
C:\Users\Pepin\GitHub\SOC6302\scripts
RELATIVE FILE PATHS
Take the left side elevators to the 17th floor.
Go through the double doors and a take a right.
First door on your left.
here(scripts)
Tour: RStudio Panes
Sit back and enjoy the show!
- open R script or Quarto document
- Environment (data values, functions)
- Plots, Help, Viewer
- Console – typing
Tour recap: Panes
There are four key regions or “panes” in the interface:
Source pane: where you can edit and save R scripts or author computational documents like Quarto and R Markdown.
Console pane: is used to write short interactive R commands.
Environment pane: displays temporary R objects created during that R session.
Output pane: displays the plots, tables, or HTML outputs of executed code along with files saved to disk.
Console pane: best for exploring.
Source pane: best for documenting.
R-scripts and Quarto
Open RStudio, then click the dropdown arrow next to the “New File icon,” and then “R script” or “Quarto Document.”
Alternatively, hold down “Ctrl” + “Shift” + “N.”
An R-script is a file that will contain the documentation of the code of what you tell R to do.
Blank slate
Clear the memory at every restart of RStudio by turning off the automatic saving of your workspace and .Rdata files with you quit RStudio. This is important for reproducibility, debugging, and avoiding littering your computer with unnecessary files.
Set this via:
- Tools > Global Options.
- Uncheck “Restore .RData into Workspace at Startup”.
- Choose “Never” on the “Save workspace to .RData on exit”.
- Click “Apply” and “OK”.
Comprehensive R Archive Network (CRAN)
CRAN is like an App Store for R. It hosts R packages, documentation, and source code contributed by users worldwide. It is mediated (e.g., quality controlled), making it incredibly reliable.
R users can easily install, update, and share R packages using install.packages().
Packages
R comes with basic tools, but packages extend the capabilities of base R (what you already installed). An R package is like a toolbox: a collection of functions, data, and documentation that help you do specific tasks using R.
You’ll install each package (only once per system):
install.packages("tidyverse")Warning: package 'tidyverse' is in use and will not be installed
You’ll load each package (every time you use it):
library(tidyverse)Support
Some help videos and further explanation:
Quarto
Quarto
The tool you’ll use to create reproducible computational documents. Every piece of assignment you hand in will be a Quarto document.
- Fully reproducible reports
- R code + narrative
You are likely familiar with word processors like MS Word or Google Docs. We will not be using these in this class. Instead, the words you would write in such a document, as well as your R code, will go into a Quarto document. You will render the document (more on what this means later) to get a document out that has your words, code, and the output of that code. Everything in one place, beautifully formatted!
RScript
great for learning, exploring and tinkering.
rerun it without attention to formatting or markdown.
Quarto
great for communicating analysis and results
combines narrative explanation with code output (results.
Documentation
In .R scripts and Quarto, you can document your code. Err on the side of over documentation. Your future self will thank you. In .R scripts, the way to make a comment rather than a command, is to put a pound sign in front of the text.
Tour: Quarto
Sit back and enjoy the show!
- YAML Ain’t Markup Language
- code chunk (+ code comments)
- narrative (heading levels)
- render
- run (line vs code chunk)
- source and visual
Tour recap: Quarto
Source: Dr. Mine Çetinkaya-Rundel
Tour recap: Quarto Code-chunks
- chunk labels are helpful for describing what the code is doing, for jumping between code cells in the editor, and for troubleshooting
message: falsehides any messages emitted by the code in your rendered document
Source: Dr. Mine Çetinkaya-Rundel
How will we use Quarto?
- Every code-along and milestone will be a Quarto document
- The scaffolding will decrease over the course
- You will create and submit a Quarto document for your research project
Getting Started
Create a RStudio Project
To create a new project in RStudio, click: File > New Project.
In the New Project wizard that pops up, select: New Directory, then New Project.
Name the project “SOC6302” and click: Create Project.
This will launch you into a new RStudio Project inside a new folder called “SOC6302”.
Your first code-along
Download and open code-along-01.qmd
Packages
We’ll use the following packages:
here()(relative file paths)tidyverse()(data wrangling)gssr()(U.S. General Social Survey data)gssrdoc()(GSS documentation)
Install here() and tidyverse()
Let’s install the two packages that are available on CRAN.
Copy and paste the following code into your Console pane. Then hit enter.
install.packages("here")Then, do the same to install the tidyverse package.
install.packages("tidyverse")Install gssr() and gssrdoc()
# Install 'gssr' from 'ropensci' universe
install.packages('gssr', repos =
c('https://kjhealy.r-universe.dev', 'https://cloud.r-project.org'))
# Also recommended: install 'gssrdoc' as well
install.packages('gssrdoc', repos =
c('https://kjhealy.r-universe.dev', 'https://cloud.r-project.org'))Load the packages
library(here)
library(tidyverse)
library(gssr)
library(gssrdoc)Environment
# software documentation
sessionInfo()R version 4.5.1 (2025-06-13 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)
Matrix products: default
LAPACK version 3.12.1
locale:
[1] LC_COLLATE=English_Canada.utf8 LC_CTYPE=English_Canada.utf8
[3] LC_MONETARY=English_Canada.utf8 LC_NUMERIC=C
[5] LC_TIME=English_Canada.utf8
time zone: America/Toronto
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] gssrdoc_0.7.0 here_1.0.1 conflicted_1.2.0 summarytools_1.1.4
[5] flextable_0.9.7 kableExtra_1.4.0 labelled_2.14.1 haven_2.5.5
[9] gssr_0.7 lubridate_1.9.4 forcats_1.0.0 stringr_1.5.1
[13] dplyr_1.1.4 purrr_1.0.4 readr_2.1.5 tidyr_1.3.1
[17] tibble_3.3.0 ggplot2_3.5.2 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] gtable_0.3.6 xfun_0.52 htmlwidgets_1.6.4
[4] tzdb_0.5.0 vctrs_0.6.5 tools_4.5.1
[7] generics_0.1.4 curl_6.3.0 pacman_0.5.1
[10] pkgconfig_2.0.3 data.table_1.17.6 checkmate_2.3.2
[13] pryr_0.1.6 RColorBrewer_1.1-3 uuid_1.2-1
[16] lifecycle_1.0.4 compiler_4.5.1 farver_2.1.2
[19] rapportools_1.2 textshaping_1.0.1 codetools_0.2-20
[22] fontquiver_0.2.1 fontLiberation_0.1.0 htmltools_0.5.8.1
[25] yaml_2.3.10 pillar_1.11.0 MASS_7.3-65
[28] openssl_2.3.3 cachem_1.1.0 magick_2.8.7
[31] fontBitstreamVera_0.1.1 tidyselect_1.2.1 zip_2.3.3
[34] digest_0.6.37 stringi_1.8.7 reshape2_1.4.4
[37] pander_0.6.6 rprojroot_2.1.0 fastmap_1.2.0
[40] grid_4.5.1 cli_3.6.5 magrittr_2.0.3
[43] base64enc_0.1-3 withr_3.0.2 backports_1.5.0
[46] gdtools_0.4.1 scales_1.4.0 timechange_0.3.0
[49] rmarkdown_2.29 officer_0.6.7 matrixStats_1.5.0
[52] askpass_1.2.1 ragg_1.4.0 hms_1.1.3
[55] memoise_2.0.1 evaluate_1.0.4 knitr_1.50
[58] tcltk_4.5.1 viridisLite_0.4.2 rlang_1.1.6
[61] Rcpp_1.0.14 glue_1.8.0 xml2_1.3.8
[64] svglite_2.1.3 rstudioapi_0.17.1 jsonlite_2.0.0
[67] plyr_1.8.9 R6_2.6.1 systemfonts_1.2.3
[70] fs_1.6.6
Project structure
Let’s set up your project structure using the here() package.
here()
First, let’s establish our project directory
# shows the file path to the root of the project
here()Next, we’ll create folders within our project.
Example folder structure
Research Projects
project/
data/
scripts/
outputs/
figures/
SOC6302
SOC6302/
data/
code-alongs/
milestones/
project/
data/
scripts/
outputs/
Create a folder structure
using here() and dir.create()
# Create base folders
dir.create(here("data"), recursive = TRUE)
dir.create(here("code-alongs"), recursive = TRUE)
dir.create(here("milestones"), recursive = TRUE)
dir.create(here("project"), recursive = TRUE)Create sub-folders
using here() and dir.create()
# Create project sub-folders
dir.create(here("project", "data"), recursive = TRUE)
dir.create(here("project", "scripts"), recursive = TRUE)
dir.create(here("project", "outputs"), recursive = TRUE)Check your work
report a list of folders and or files in the R-project folders and sub-folder.
# Your SOC6302 class folder
list.files(path = here())
# Your "Project" sub-folder
list.files(path = here("project"))Save code-along
Save this code-along in your newly created “code-along” sub-folder.
There’s no command in the R console to save scripts or Quarto files— you use the editor’s File > Save As or Ctrl+S.
Meet your data
We’re going to use data from the U.S. General Social Survey (GSS).
Load your data
# Load the data (will appear in your Global Environment pane)
data(gss_all)
# Preview the datatable which is automatically named gss_all
gss_all# A tibble: 75,699 × 6,867
year id wrkstat hrs1 hrs2 evwork occ prestige
<dbl+lbl> <dbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl> <dbl+lb>
1 1972 1 1 [workin… NA(i) [iap] NA(i) [iap] NA(i) [iap] 205 50
2 1972 2 5 [retire… NA(i) [iap] NA(i) [iap] 1 [yes] 441 45
3 1972 3 2 [workin… NA(i) [iap] NA(i) [iap] NA(i) [iap] 270 44
4 1972 4 1 [workin… NA(i) [iap] NA(i) [iap] NA(i) [iap] 1 57
5 1972 5 7 [keepin… NA(i) [iap] NA(i) [iap] 1 [yes] 385 40
6 1972 6 1 [workin… NA(i) [iap] NA(i) [iap] NA(i) [iap] 281 49
7 1972 7 1 [workin… NA(i) [iap] NA(i) [iap] NA(i) [iap] 522 41
8 1972 8 1 [workin… NA(i) [iap] NA(i) [iap] NA(i) [iap] 314 36
9 1972 9 2 [workin… NA(i) [iap] NA(i) [iap] NA(i) [iap] 912 26
10 1972 10 1 [workin… NA(i) [iap] NA(i) [iap] NA(i) [iap] 984 18
# ℹ 75,689 more rows
# ℹ 6,859 more variables: wrkslf <dbl+lbl>, wrkgovt <dbl+lbl>,
# commute <dbl+lbl>, industry <dbl+lbl>, occ80 <dbl+lbl>, prestg80 <dbl+lbl>,
# indus80 <dbl+lbl>, indus07 <dbl+lbl>, occonet <dbl+lbl>, found <dbl+lbl>,
# occ10 <dbl+lbl>, occindv <dbl+lbl>, occstatus <dbl+lbl>, occtag <dbl+lbl>,
# prestg10 <dbl+lbl>, prestg105plus <dbl+lbl>, indus10 <dbl+lbl>,
# indstatus <dbl+lbl>, indtag <dbl+lbl>, marital <dbl+lbl>, …
A “tibble” is another name for “tidy dataset,” meaning that the data is organized in structured, clear rows and columns. “(75,699 × 6,867)” means the dataset contains 75,699 rows and 6,867 columns. Commonly, in social sciences, rows are referred to as “observations” and columns as “variables.” In our case, there are 75,699 observations (e.g., respondents) and 6,867 variables.
Load GSS 2024
# Get the data only for the 2024 survey respondents
gss24 <- gss_get_yr(2024)Fetching: https://gss.norc.org/documents/stata/2024_stata.zip
# look at the first 6 rows of the dataframe
head(gss24)# A tibble: 6 × 639
year id wrkstat hrs1 hrs2 evwork marital martype
<dbl+lb> <dbl> <dbl+l> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+l> <dbl+lbl>
1 2024 1 1 [wor… 43 NA(i) [iap] NA(i) [iap] 5 [nev… NA(i) [iap]
2 2024 2 5 [ret… NA(i) [iap] NA(i) [iap] 1 [yes] 5 [nev… NA(i) [iap]
3 2024 3 5 [ret… NA(i) [iap] NA(i) [iap] 1 [yes] 1 [mar… 1 [mar…
4 2024 4 2 [wor… 20 NA(i) [iap] NA(i) [iap] 5 [nev… NA(i) [iap]
5 2024 5 5 [ret… NA(i) [iap] NA(i) [iap] 1 [yes] 3 [div… NA(i) [iap]
6 2024 6 4 [une… NA(i) [iap] NA(i) [iap] NA(i) [iap] 1 [mar… 1 [mar…
# ℹ 631 more variables: divorce <dbl+lbl>, widowed <dbl+lbl>,
# spwrksta <dbl+lbl>, sphrs1 <dbl+lbl>, sphrs2 <dbl+lbl>, spevwork <dbl+lbl>,
# cowrksta <dbl+lbl>, coevwork <dbl+lbl>, cohrs1 <dbl+lbl>, cohrs2 <dbl+lbl>,
# sibs <dbl+lbl>, childs <dbl+lbl>, age <dbl+lbl>, educ <dbl+lbl>,
# speduc <dbl+lbl>, coeduc <dbl+lbl>, codeg <dbl+lbl>, degree <dbl+lbl>,
# padeg <dbl+lbl>, madeg <dbl+lbl>, spdeg <dbl+lbl>, sex <dbl+lbl>,
# race <dbl+lbl>, res16 <dbl+lbl>, reg16 <dbl+lbl>, mobile16 <dbl+lbl>, …
Browse dataframe
With your mouse, go to the environment panel (upper-right) and click on the “gss24” object. It pops up and you can browse through it.
This is often a good idea to get a first feel for the data, but only if your dataset is relatively small.
Author Naomi Wolf wrote a book about the treatment of homosexuality in 19th century England. She reported that it was punishable by death since “sodomy” charges were followed by “death recorded” in the judicial logs. However, “sodomy” referred to a host of other sexual offenses, and “death recorded” specifically means they weren’t killed because their death was recorded but not carried out.
Paul Dolan wrote a book about how marriage makes you miserable. A key part of his evidence was that in the American Time Use Survey, married people surveyed about their happiness when the spouse was present reported being happy, but when the spouse was absent they reported being unhappy. Covering up unhappiness in front of their spouse, truly miserable! But while Dolan read spouse present/absent as “spouse is in the room during questioning”, it actually meant “spouse together/separated.” He was comparing couples that were together vs. couples that were separating!
Codebook
The GSS documentation is available online in .pdf form.
The .pdfs will be useful for general overviews.
For specific variable information, it will be helpful to use the documentation you’ll load into RStudio.
# Load the codebook
data(gss_dict)Names
To see the variables available in the dataset, use the names() command.
names(gss_all)Variable documentation
For information about a specific GSS variable,
type ?varname at the console.
In the output pane, the Help tab will show the variable documentation.
Variable documentation example
Warning in readLines("images/01/meovrwrk.txt"): incomplete final line found on
'images/01/meovrwrk.txt'
meovrwrk {gssrdoc} R Documentation
Men hurt family when focus on work too much
Description
meovrwrk
Details
Question 1297. And, do you agree or disagree: c. Family life often suffers because men concentrate too much on their work.
Overview
For further details see the official GSS documentation.
Counts by year:
year iap agree can't choose disagree neither agree nor disagree no answer strongly agree strongly disagree skipped on web Total
1972 1613 - - - - - - - - 1613
1973 1504 - - - - - - - - 1504
1974 1484 - - - - - - - - 1484
1975 1490 - - - - - - - - 1490
1976 1499 - - - - - - - - 1499
1977 1530 - - - - - - - - 1530
1978 1532 - - - - - - - - 1532
1980 1468 - - - - - - - - 1468
1982 1860 - - - - - - - - 1860
1983 1599 - - - - - - - - 1599
1984 1473 - - - - - - - - 1473
1985 1534 - - - - - - - - 1534
1986 1470 - - - - - - - - 1470
1987 1819 - - - - - - - - 1819
1988 1481 - - - - - - - - 1481
1989 1537 - - - - - - - - 1537
1990 1372 - - - - - - - - 1372
1991 1517 - - - - - - - - 1517
1993 1606 - - - - - - - - 1606
1994 1545 695 33 243 286 27 122 41 - 2992
1996 1444 825 16 198 169 1 230 21 - 2904
1998 2832 - - - - - - - - 2832
2000 940 877 43 361 331 22 209 34 - 2817
2002 1857 415 6 264 108 - 99 16 - 2765
2004 1906 460 4 188 135 - 94 25 - 2812
2006 2518 945 14 477 304 1 208 43 - 4510
2008 694 653 12 310 161 - 143 50 - 2023
2010 614 662 6 388 192 3 122 57 - 2044
2012 672 558 11 382 170 - 130 51 - 1974
2014 863 702 7 479 234 1 176 76 - 2538
2016 979 819 9 536 257 - 171 96 - 2867
2018 789 644 11 475 220 2 134 73 - 2348
2021 1315 886 1 487 1001 - 202 138 2 4032
2022 1168 885 15 537 618 1 201 117 2 3544
2024 1126 787 19 481 611 - 195 89 1 3309
Total 50650 10813 207 5806 4797 58 2436 927 5 75699
Values
1 strongly agree
2 agree
3 neither agree nor disagree
4 disagree
5 strongly disagree
NA(d) can't choose
NA(i) iap
NA(j) I don't have a job
NA(m) dk, na, iap
NA(n) no answer
NA(p) not imputable
NA(r) refused
NA(s) skipped on web
NA(u) uncodeable
NA(x) not available in this release
NA(y) not available in this year
NA(z) see codebook
Source
General Social Survey https://gss.norc.org
[Package gssrdoc version 0.7.0 Index]
Variable name: meovrwrk
Variable label: Men hurt family when focus on work too much
1994 was the first year of the survey.
695 respondents agreed with the statement.
iap – missing. Values: the numeric and response category key (1 = strongly agree)
We can find which years one or more variables were asked with the gss_which_years() function.
gss_which_years(gss_all, meovrwrk)# A tibble: 35 × 2
year meovrwrk
<dbl+lbl> <lgl>
1 1972 FALSE
2 1973 FALSE
3 1974 FALSE
4 1975 FALSE
5 1976 FALSE
6 1977 FALSE
7 1978 FALSE
8 1980 FALSE
9 1982 FALSE
10 1983 FALSE
# ℹ 25 more rows